Skip to content

feat: Add support for defining and using specific columns for metadata filtering.#19

Merged
anlowee merged 19 commits intoy-scope:release-0.293-clp-connectorfrom
anlowee:xwei/metadata-filter-v2
Jul 7, 2025
Merged

feat: Add support for defining and using specific columns for metadata filtering.#19
anlowee merged 19 commits intoy-scope:release-0.293-clp-connectorfrom
anlowee:xwei/metadata-filter-v2

Conversation

@anlowee
Copy link

@anlowee anlowee commented Jun 24, 2025

Description

The CLP metadata MySQL database already supported some metadata filtering like timestamp. It is similar to what we did before to use the date-sorted directory structure to filter archives at top level, which can significantly reduce the number of splits to scan.

Benefits:

  • Reduce the number of splits to scan and improve the performance.

Trade Offs:

  • Niche performance cost on generating metadata SQL when generating KQL.
  • Niche performance cost on string replacement when mapping to custom metadata columns.

To implement this, we added a new config property and designed a JSON format config file for metadata filter configuration. For example:

{
  "clp": [
      {
        "columnName": "level"
      }
  ],
  "clp.default": [
      {
        "columnName": "author"
      }
  ],
  "clp.default.table_1": [
      {
        "columnName": "msg.timestamp",
        "rangeMapping": {
          "lowerBound": "begin_timestamp",
          "upperBound": "end_timestamp"
        },
        "required": true
      },
      {
        "columnName": "file_name"
      }
  ]
}

Explanation:

  • "clp": Adds a filter on the column level for all schemas and tables under the clp catalog.
  • "clp.default": Adds a filter on author for all tables under the clp.default schema.
  • "clp.default.table_1": Adds two filters for the table clp.default.table_1:
    • msg.timestamp is remapped via rangeMapping and is marked as required.
    • file_name is used as-is without remapping.

For more details about this config file, please refer to the modified clp.rst doc.

We also modified the ClpExpression and ClpFilterToKqlConverter to extract the metadata filter SQL query when generating KQL pushdown.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

E2E test with sample log with format like:

"msg": {
    "timestamp": "1234",
    "message":"hahaha",
    "logLevel": "INFO"   
}

A new unit test class TestClpMetadataFilterConfig and a few unit tests added into existing classesTestClpFilterToKql and TestClpSplit.

Summary by CodeRabbit

  • New Features

    • Introduced metadata-based filtering in the CLP connector with configurable JSON filters at catalog, schema, and table levels.
    • Added a configuration property to specify the metadata filter config file path.
    • Enabled required filters and range mappings for metadata columns to enforce query constraints and optimize filtering.
    • Enhanced query optimization to integrate metadata SQL filters alongside KQL pushdown expressions.
  • Documentation

    • Added detailed documentation on metadata filter configuration, structure, usage, and examples.
  • Bug Fixes

    • Improved type mapping by treating date strings as timestamps.
  • Tests

    • Added tests validating metadata filter configuration, SQL remapping, and metadata-based split filtering.
    • Refactored existing tests for filter pushdown and added coverage for metadata SQL generation.
  • Chores

    • Enhanced logging for query construction and split listing to aid debugging and traceability.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants